Statistical properties of sampled networks by random walks 



O 
O 

(N 

O 

Q 

in 



Sooycon Yoon, Sungmin Lee, Soon-Hyung Yook0 and Yup Kinjl] 

Department of Physics and Research Institute for Basic Sciences, Kyung Hee University, Seoul 130-701, Korea 

(Dated: February 2, 2008) 

We study the statistical properties of the sampled networks by a random walker. We compare 
topological properties of the sampled networks such as degree distribution, degree-degree correlation, 
and clustering coefficient with those of the original networks. From the numerical results, we find 
that most of topological properties of the sampled networks are almost the same as those of the 
original networks for 7 < 3. In contrast, we find that the degree distribution exponent of the 
sampled networks for 7 > 3 somewhat deviates from that of the original networks when the ratio of 
the sampled network size to the original network size becomes smaller. We also apply the sampling 
method to various real networks such as collaboration of movie actor, world wide web, and peer-to- 
peer networks. All topological properties of the sampled networks show the essentially same as the 
original real networks. 

PACS numbers: 05.40.Fb,89.75.Hc,89.75.Fb 
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Since the concept of complex network [l[ came into the 
limelight, many physically meaningful analyses for the 
complex networks in real world have emerged. The ex- 
amples of such studies include the protein-protein inter- 
action networks (PIN) @, world wide web (WWW) 0, 
email network Q , etc. The empirical data or the informa- 
tion of the real networks can be collected in various ways, 
for example, the traceroutes for the Internet [f| and high 
throughput experiments for protein interaction map [6|. 
Thus, it is natural assumption that the empirical data 
can be incomplete due to various reasons which include 
some limitations of the experiments and experimental er- 
rors or biases. As a result, many real networks which 
have been intensively studied so far can be regarded as 
sampled networks. Moreover, several studies have shown 
that the dynamical properties on the networks can be 
significantly affected by the underlying topology @, @]. 
Therefore, it is very important and interesting to study 
the topological differences between the sampled networks 
and the whole networks. 

Recently, several sampling methods such as random 
node sampling [13] , random link sampling, and snow- 
ball sampling were studied [lOl ]. The random node sam- 
pling is the simplest method in which the sampled net- 
work consists of randomly selected nodes with a given 
probability p and the original links between the selected 
nodes. On the other hands, in the random link sampling, 
the links are randomly selected and the nodes connected 
to the selected links are kept. These two random sam- 
pling methods have been used to study the statistical 
survey in some social networks. In the random sampling 
method, however, many important nodes such as hubs 
cannot be sampled due to the even selection probability. 
Some recent studies show that some networks such as 
PIN, the topological properties of randomly sampled net- 



works significantly deviate from those of the original net- 
works |ftlTol|. The idea of the snowball sampling method 
1C , llli is similar to the breath-first search algorithm 
121 Il3j . In the snowball sampling method all the nodes 
directly connected to the randomly chosen starting node 
are selected. Then all the nodes linked to those selected 
nodes in the last step are selected hierarchically. This 
process continues until the sampled network has the de- 
sired number of nodes fiol ] . Previous studies showed that 
the topological properties of the sampled networks closely 
depend on the sampling methods [lfj. 

In this paper, we focus on the effect of the weighted 
sampling on the topological properties of sampled net- 
works. In order to assign nontrivial weight to each node, 
we first note the structure of the real networks. Many 
real networks are known to be the scale-free networks in 
which the degree distribution follows the power-law 



P(k) 



(1) 



Moreover, the probability p v (k) that a random walker 
(RW) visits a node of degree k [7fl is given by 



p v (k) ~ k 



(2) 
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The degree k causes uneven probability of finding a node 
by a RW on the heterogeneous networks. Thus, by us- 
ing the RW for sampling we can assign automatically 
the nontrivial weight to each node which is proportional 
to the degree of the node. Due to the uneven visiting 
probability, the nodes having the large degree, i.e., topo- 
logically important parts, can be easily found regardless 
of the starting position of the RW. Therefore, we expect 
that the sampling by the RW can provide more effective 
way to obtain the sub-networks which have almost the 
same statistical properties as the original one. Further- 
more, we also study the effects of the heterogeneity of the 
original networks on the RW sampling method (RWSM) 
by changing 7. This weighted sampling method is also 
shown to be successfully applied to obtain the important 
informations of many real networks such as actor net- 
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works, WWW, and peer-to-peer (P2P) networks. There- 
fore, we expect that this study can provide a better in- 
sight to understand important properties of the real net- 
works and offer a systematic approach to the sampling of 
networks with various 7. 

We now introduce the RWSM. First, we generate orig- 
inal scale-free networks (SFN) by use of the static model 
in Ref. [l4| from which various sizes of sub-networks 
are sampled. The size or number N a of nodes of the 
original network with each 7 is set to be N a = 10 6 . 
The typical values of 7 used in the simulations arc 
7 = 2.23,2.51,3.05,3.40, and 4.2. We set the average 
degree (k) = 4 for each network. After the preparation 
of original networks, a RW is placed at a randomly chosen 
node and moves until it visits N s distinct nodes. Then 
we construct sub-networks with these N s visited nodes 
and the links which connect any pair of nodes among 
the N s visited nodes in the original network. We use 
N a = 10 3 , 10 4 , 2 x 10 4 , 4 x 10 4 , 6 x 10 4 , 8 x 10 4 , 10 5 , and 
1,2,3, ••• ,9 x 10 5 . 

The degree distribution is one of the most important 
measure for the heterogeneity of networks [l[. In Fig. 
[TJ we compare the degree distributions of the sampled 
networks to those of the original networks for various 
7. We find that the degree distribution of the sampled 
network also satisfies the power-law, P(k) ~ k~ ls . 

Especially, from the data in Figs. [1] (a)-(d) we find 
that the 7 S of the sampled networks with N s /N a > 0.01 
is nearly equal to 7 of the original network, even though 
the 7 S for the small N s has relatively larger error bar. It 
shows that the sampling method by RW does not change 
the heterogeneity in degree for networks with 2 < 7 <3. 
Since most of the real networks have 2 < 7 < 3 [J, 
this result is practically important. We summarize the 
obtained 7 s 's for various N s 's and 7's in Table HI 

In contrast to the case 7 < 3, 7 S for 7 > 3 slightly 
deviates from the 7 of the original networks if N s /N a < 
0.1. (See the data for 7 = 4.2 in Figs. [1] (c) and (f) or in 
Table |TJ) Numerically we find that 7s is nearly equal to 
the original 7 for N s /N a > 0.1 when 7 < 4.2. Of course 
one can expect the substantial deviation of "f s from 7 as 
7 increases further from 7 = 4.2. 

This 7 dependent behavior of P(k) can be understood 
from Eqs. (Q]) and @. Equation |T]) indicates that (A; 2 ) 
diverges with finite (k) for 7 < 3. This implies that the 
topology of a network has several dominant hubs which 
have extraordinary large number of degrees when 7 < 
3. Since the probability of visitation of the RW follows 
Eq. ([2]), the RW can more effectively find the central 
part of the network around the hubs when 7 < 3. Thus 
the sampled networks can inherit easily the topological 
properties of the original networks. 

RWSM is also applied to real networks. In Fig. [21 
we show the P(k) of the actor network fTa], the WWW 
@|, and the P2P networks (Gnutella) [Tg]. The num- 
ber of nodes in the original real networks are N Q = 
392340, 325729, and 1074843 for the actor network, the 
WWW, and the Gnutella, respectively. The degree dis- 
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FIG. 1: (Color online) Degree distributions for sampled net- 
works of static scale-free networks with (a) 7 = 2.51, (c) 
3.05, and (e) 4.2. Degree exponents 7^ for the sampled net- 
works extracted from the original network for the network size 
N = 10 6 with (b) 7 = 2.51, (d) 7 = 3.05, and (f) 7 = 4.2. 
The slopes of solid lines in (a,c,e) and the values of the dashed 
lines in (b,d,f) are the degree exponents of the original net- 
works. 



tributions for the actor network and the WWW follow 
the power-law with 7 = 2.2 (actor) [l5[ and 7 = 2.6 
(WWW) 0. The data in Fig. 1 (a) shows that P{k) 
of the sampled actor network follows the power-law with 
7 S = 2.15 for N s > 10 3 . This value of 7s is quite close to 
7 = 2.2. In contrast 7s seems to deviate from 7 of the 
original network for small N s (= 10 3 ). However, the de- 
gree exponent j a for N s = 10 3 still has almost the same 
value with that of the original network over one decade 
(k = 10 ~ 100). In the case of the WWW, the 7 S of the 
sampled networks well agrees with 7 ~ 2.6 even for small 
N s (= 10 3 ) (see Fig. C^b)). For the Gnutella, P(k) of the 
original network does not follow the simple power-law 
([1]). However, as one can see in Fig. [2](c), the Gnutella 
network also has big hubs which cause high heterogene- 
ity in degree, and the sampled networks nearly show the 
same degree distribution as the original one. These re- 
sults also provide the evidence that the nodes with large 
degrees play an important role in the RWSM. 

Another important measure to characterize the topo- 
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FIG. 2: (Color online) Degree distributions for sampled net- 
works of three real networks, (a) Collaboration network of 
movie actors (N„ = 392, 340, 7 = 2.2) QJ], (b) WWW (N a = 
325, 729, 7 = 2.6) [3, and (c) Gnutella (N a = 1, 074, 843) []J|. 
The slopes of the solid lines in (a) and (b) are the values of 
degree exponents obtained from the simple linear fitting for 
degree distributions of the sampled networks. 



logical properties of the complex network is the degree- 
degree correlation. Many interesting topological proper- 
ties such as the self-similarity [It} can be affected by the 
degree-degree correlation. The degree-degree correlation 
can be characterized by {k nn (k)}, the average degree of 
the nearest-neighbors of nodes with degree k If 
the (k nn (k)} increases (decreases), the network is charac- 
terized as assortative (disassortative) mixing. As shown 
in Fig. [3] (a), for the static SFN with 2 < 7 < 3 the 
original network and the sampled networks all show the 
disassortative mixing. This can be explained by the dy- 



namical properties of RWs on complex networks. In the 
networks showing disassortative mixing, the RW on a hub 
should go through a node of small k to move to another 
hub. Thus, many nodes having small k can be connected 
to the hubs in the sampled networks and the sampled net- 
works remain disassortative. If the networks have neutral 
degree correlation, then the networks sampled by the RW 
also show neutral degree correlation. (Sec Figs. 02(b) and 
(c).) In Figs. [3](d)-(f), we plot (k nn (k)) of real networks. 
(fcnra(&))'s of the sampled networks show the same degree 
correlations as those of the original networks. As shown 
in Figs. [3] (d)-(f), the degree correlations are assorta- 
tive, disassortative, and neutral for the actor, WWW, 
and Gnutella networks, respectively. 
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FIG. 3: (Color online) Distributions of (k nn ) for sub-networks 
extracted from the original networks with (a) 7 = 2.23, (b) 
7 = 3.05, and (c) 7 = 4.2. (d) Collaboration network of movie 
actors, (e) WWW. (f) Gnutella. 

We also measure a clustering coefficient of the sampled 
networks. The clustering coefficient d of a node i is 
defined by 



Ci 



2w 



h(h-i) ' 



(3) 



where fc, is the degree of node i and yi is the number of 
connections between its nearest neighbors (7j physi- 
cally means the fraction of connected pairs among pairs 
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of node i's neighbors. C, is one if all neighbors are com- 
pletely connected, whereas Cj becomes zero on a infinite- 
sized random network 

In Fig. |U we plot the clustering coefficient C(k) 
against degree k. C(k) is known to reflect the modu- 
lar structure of networks [H, C(k) does not depend 
on k if the network does not have any well defined hi- 
erarchical modules [H, H0|- As shown in Fig. dj C(k) 
of both the original networks and the sampled networks 
shows a tendency to decrease with increasing k for SFN 
with 7 = 2.23 and real networks. (See Figs. H] (a) and 
(d)-(f)). This implies that the sampled networks have 
the same modular structure with original networks. On 
the other hand, the topology of networks with 7^3 
resembles closely the random graph, thus C(k) does not 
depend on the degree k [2(|. The dependence of C(k) 
on k for the sampled SFNs with 7 > 3 is also nearly the 
same as the original SFNs. (See Figs. 2] (b) and (c).) 



We study the topological properties of sampled net- 
works by RWSM with SFN and several real networks. 
From the numerical simulations, we find that the P{k) of 
the sampled network follows the power-law, P(k) ~ fc~ 7s . 
We also find that the 7 S ~ 7 for all N s when 2 < 7 < 3. 
Even though 7^ somewhat increases as decreasing N s for 
7 > 3, the 7 s 's with N S /N Q > 0.1 still follow the origi- 
nal one. We also study the degree-degree correlation and 
clustering coefficient by measuring {k nn (k)) and C(k). 
The sampled networks have the same degree correlation 
and modular structure with the original networks for all 
values of 7. The RWSM is also applied to the actor, 
WWW, and Gnutella networks. By measuring P(k), 
(k n n{k)), and C(fc), we confirm that the topological prop- 
erties of the sampled networks are well maintained after 
sampling and the RWSM is efficient sampling method for 
the real networks. 
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The 7 dependent behavior of the sampled networks 
can be understood from the dynamical property of a 
RW. Since most of the networks in the real world have 
2 < 7 < 3, the results imply very important meaning 
in practice. Based on our results, we expect that if we 
obtain the empirical data by weighted sampling in which 
the weight is proportional to the degree, then the sam- 
pled networks can share the same topological properties 
with the whole network. Especially, the weighted sam- 
pling method becomes very efficient as the heterogeneity 
of networks increases. At the same time, we also expect 
that our study can provide a systematic way to extract 
sub-networks from the empirical data and to study vari- 
ous dynamical properties of the real networks [2l| . 
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